In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt

plt.style.use('ggplot')
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

1. Import Data

In [2]:
OR_df = pd.read_excel("online_retail_II.xlsx", )
OR_df.head()
Out[2]:
Invoice StockCode Description Quantity InvoiceDate Price Customer ID Country
0 489434 85048 15CM CHRISTMAS GLASS BALL 20 LIGHTS 12 2009-12-01 07:45:00 6.95 13085.0 United Kingdom
1 489434 79323P PINK CHERRY LIGHTS 12 2009-12-01 07:45:00 6.75 13085.0 United Kingdom
2 489434 79323W WHITE CHERRY LIGHTS 12 2009-12-01 07:45:00 6.75 13085.0 United Kingdom
3 489434 22041 RECORD FRAME 7" SINGLE SIZE 48 2009-12-01 07:45:00 2.10 13085.0 United Kingdom
4 489434 21232 STRAWBERRY CERAMIC TRINKET BOX 24 2009-12-01 07:45:00 1.25 13085.0 United Kingdom
In [3]:
OR_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525461 entries, 0 to 525460
Data columns (total 8 columns):
Invoice        525461 non-null object
StockCode      525461 non-null object
Description    522533 non-null object
Quantity       525461 non-null int64
InvoiceDate    525461 non-null datetime64[ns]
Price          525461 non-null float64
Customer ID    417534 non-null float64
Country        525461 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 32.1+ MB

We check the completion, find that some CustomerID record missing value. Becasue this analyst is aiming at customer, so we remove this part.

2. Data Cleaning

Remove missing value

In [4]:
df = OR_df[OR_df['Customer ID'].notna()].reset_index(drop=True)
df.rename(columns={'Customer ID': 'CustomerID'}, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 417534 entries, 0 to 417533
Data columns (total 8 columns):
Invoice        417534 non-null object
StockCode      417534 non-null object
Description    417534 non-null object
Quantity       417534 non-null int64
InvoiceDate    417534 non-null datetime64[ns]
Price          417534 non-null float64
CustomerID     417534 non-null float64
Country        417534 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 25.5+ MB
In [5]:
# delete test records
df = df[~df['Description'].str.contains('test')]
# dont consider returns
df = df[df['Quantity']>0]

Standlize attributes

Invoice Date is string which is not convient for analysis later. Transform data type to datetime

In [6]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['Invoice'] = df['Invoice'].astype('int')

Set colomn: Month, Day, TotalPrice(price*amount)

In [7]:
df['Month'] = df.InvoiceDate.values.astype('datetime64[M]')
# df['Day'] = df.InvoiceDate.dt.day
df['TotalPrice'] = df['Quantity']*df['Price']
In [8]:
df['Month'] = df['Month'].astype('str').apply(lambda x: x[:-3])
df.head()
Out[8]:
Invoice StockCode Description Quantity InvoiceDate Price CustomerID Country Month TotalPrice
0 489434 85048 15CM CHRISTMAS GLASS BALL 20 LIGHTS 12 2009-12-01 07:45:00 6.95 13085.0 United Kingdom 2009-12 83.4
1 489434 79323P PINK CHERRY LIGHTS 12 2009-12-01 07:45:00 6.75 13085.0 United Kingdom 2009-12 81.0
2 489434 79323W WHITE CHERRY LIGHTS 12 2009-12-01 07:45:00 6.75 13085.0 United Kingdom 2009-12 81.0
3 489434 22041 RECORD FRAME 7" SINGLE SIZE 48 2009-12-01 07:45:00 2.10 13085.0 United Kingdom 2009-12 100.8
4 489434 21232 STRAWBERRY CERAMIC TRINKET BOX 24 2009-12-01 07:45:00 1.25 13085.0 United Kingdom 2009-12 30.0

3. Analysis of consumer consumption trends

3.1 Monthly Total Purchase Amout

In [9]:
monthly_amount = df.groupby('Month')['TotalPrice'].sum()
monthly_amount.head(5)
Out[9]:
Month
2009-12    686540.660
2010-01    557229.062
2010-02    506371.066
2010-03    699608.991
2010-04    594586.692
Name: TotalPrice, dtype: float64
In [10]:
plt.figure(figsize=(10, 5))
plt.plot(monthly_amount)
plt.title('Monthly Total Amount')
Out[10]:
Text(0.5, 1.0, 'Monthly Total Amount')

The amount rise, reach the peak in November, then fell back

3.2 Monthly Visit

In [11]:
monthly_inv = df.groupby('Month')['Invoice'].nunique()
monthly_inv.head()
Out[11]:
Month
2009-12    1507
2010-01    1007
2010-02    1104
2010-03    1524
2010-04    1328
Name: Invoice, dtype: int64
In [12]:
plt.figure(figsize=(10,5))
plt.plot(monthly_inv)
plt.title('Monthly Invoice Number')
Out[12]:
Text(0.5, 1.0, 'Monthly Invoice Number')

3.3 Monthly Sales of Product

In [13]:
monthly_sales = df.groupby('Month')['Quantity'].sum()
monthly_sales.head()
Out[13]:
Month
2009-12    400175
2010-01    370947
2010-02    372761
2010-03    503467
2010-04    352037
Name: Quantity, dtype: int64
In [14]:
plt.figure(figsize=(10,5))
plt.plot(monthly_sales)
plt.title('Monthly Sales of Product')
Out[14]:
Text(0.5, 1.0, 'Monthly Sales of Product')

Purchase time has a positive relationship with amout, which is normal. Why this trend happens? There might be promotion. Another significant reason is Black Friday.

3.4 Monthly Consumers

In [15]:
monthly_consumers = df.groupby('Month')['CustomerID'].nunique()
monthly_consumers.head()
Out[15]:
Month
2009-12     954
2010-01     719
2010-02     772
2010-03    1057
2010-04     942
Name: CustomerID, dtype: int64
In [16]:
plt.figure(figsize=(10, 5))
plt.plot(monthly_consumers)
plt.title('Monthly Consumers')
Out[16]:
Text(0.5, 1.0, 'Monthly Consumers')

For every month, number of consumers is less than purchase times, which means the Rate of Buyer Retention is not bad. First three seasons, number of consumers is between 1100-1700. The last season the number rised, peaking at 2630.

4. Individual Consumption Analysis

4.1 Average Number of Items Purchase of Each Month

In [17]:
# show: each month/how many times each cumtomer purchase   每月/每人/消费次数
# df_monthly_c_t = df.groupby(['Month', 'CustomerID'])['Invoice'].count()

avg_purch_monthly = df.groupby(['Month'])['Quantity'].sum()/df.groupby(['Month'])['CustomerID'].nunique()
avg_purch_monthly.head()
Out[17]:
Month
2009-12    419.470650
2010-01    515.920723
2010-02    482.851036
2010-03    476.316935
2010-04    373.712314
dtype: float64
In [18]:
plt.figure(figsize=(10, 5))
plt.plot(avg_purch_monthly)
plt.title('Average Number of Items Purchase of Each Month')
Out[18]:
Text(0.5, 1.0, 'Average Number of Items Purchase of Each Month')

4.2 Individual Visit and Amout

In [19]:
df_ind = df.groupby('CustomerID').agg({'Invoice':'nunique', 'Quantity':'sum','TotalPrice':'sum'})
df_ind.rename(columns={'Invoice':'Visit', 'Quantity':'Sales', 'TotalPrice':'TotalAmount'}, inplace=True)
df_ind.describe()
Out[19]:
Visit Sales TotalAmount
count 4312.000000 4312.000000 4312.000000
mean 4.453386 1284.592532 2048.185824
std 8.169775 6459.893518 8914.491881
min 1.000000 1.000000 1.550000
25% 1.000000 158.000000 307.950000
50% 2.000000 382.000000 706.020000
75% 5.000000 996.250000 1723.142500
max 205.000000 220600.000000 349164.350000
  1. Buyers come to store 4.45 times on average, maximum 203 times.
  2. Buyers purchase 1269 items on average, maximum is 218080 items.
  3. Buyers spend £2014.3 on average, with a standard deviation of 8744.23, which fluctuates greatly.
  4. Majority of customers visit less than 5 times, buy less than 1000 items, and spend less than £1700.
  5. Combining the quantile and maximum values of the two attributes, the average value is similar to the third quartile, and it is certain that there are a small number of high-volume consumer users.
In [20]:
plt.figure(figsize=(15,10))
plt.scatter(x=df_ind['Visit'], y=df_ind['TotalAmount'], alpha=0.3)
plt.title('Average')
plt.xlabel('Visit')
plt.ylabel('Amount')
Out[20]:
Text(0, 0.5, 'Amount')

According to scatter plot, majority of points has centrolized distribution. Other points might be wholesalers.

4.3 Consumption Amount

According to Chebyshev's Inequality, in this case, numbers of amount that over £8744 less than 12%, the upper bound is 8744*3+2014 = 28246. (Setting lim as 15000 is more convient.)

In [21]:
plt.figure(figsize=(10,5))
plt.hist(df_ind.query('TotalAmount<15000').TotalAmount, bins=40)
plt.xlabel('Count')
plt.ylabel('Amount')
plt.title('Consumption Amount')
Out[21]:
Text(0.5, 1.0, 'Consumption Amount')

This histogram has a long tail, most customer spend less than £2000.

4.4 Customer Consumption Cycle

同一张发票, 相同时间购买, 视作消费一次

In [22]:
# 先根据Invoice去重,认为一张发票代表一次Visit, Cycle时间间隔: Visit Date - Last Visit Date, 如没有上一次访问, 那么返回NaT.
cycle = df.drop_duplicates(subset=['CustomerID', 'Invoice']).groupby('CustomerID').apply(lambda x:x.InvoiceDate-x.InvoiceDate.shift())
df_cycle = cycle.reset_index().rename(columns={'level_1':'Invoice'}).set_index('Invoice')
# 将时间差转换为天数
df_cycle['Period'] = df_cycle['InvoiceDate']/np.timedelta64(1, 'D')
# 半天内天多次Visit, 视作一次Visit
df_cycle = df_cycle[~(df_cycle['Period']<=0.5)]
df_cycle.describe()
Out[22]:
CustomerID InvoiceDate Period
count 16993.000000 12681 12681.000000
mean 15330.181722 45 days 15:58:55.708540 45.665923
std 1699.662760 52 days 18:58:30.502193 52.790631
min 12346.000000 0 days 13:23:00 0.557639
25% 13869.000000 11 days 00:51:00 11.035417
50% 15291.000000 27 days 22:03:00 27.918750
75% 16794.000000 57 days 23:47:00 57.990972
max 18287.000000 363 days 21:19:00 363.888194
In [23]:
plt.figure(figsize=(10, 5))
plt.hist(df_cycle['Period'], bins=30)
plt.xlabel('Time period')
plt.ylabel('Frequency')
plt.title('Consumption Cyle Distribution')
/Users/zehua/anaconda3/lib/python3.6/site-packages/numpy/lib/histograms.py:824: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
/Users/zehua/anaconda3/lib/python3.6/site-packages/numpy/lib/histograms.py:825: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)
Out[23]:
Text(0.5, 1.0, 'Consumption Cyle Distribution')

The average repurchase time period is 45 days. 75% of customer repurchase within 60 days. In order to imporve loyalty, doing something between 30-40 is good.

4.5 Customer Consumption Lifetime

Lifetime: first-->last purchase
lifetime=0 means customer who only purchased one time.

In [24]:
df_csp_life = df.groupby('CustomerID')['InvoiceDate'].apply(lambda x: x.max()-x.min())
life = df_csp_life/np.timedelta64(1, 'D')
life.describe()
Out[24]:
count    4312.000000
mean      134.306095
std       132.997690
min         0.000000
25%         0.000000
50%       105.505556
75%       254.050174
max       373.222917
Name: InvoiceDate, dtype: float64

This is lifetime analysis including customer who only purchase once.

In [25]:
life_except_first = life[life>0]
life_except_first.describe()
Out[25]:
count    2896.000000
mean      199.975097
std       114.906105
min         0.000694
25%       103.632639
50%       209.938194
75%       301.226910
max       373.222917
Name: InvoiceDate, dtype: float64

This is lifetime analysis except customer who only purchase once.

In [26]:
plt.figure(figsize=(10, 5))
plt.hist(life_except_first, bins=80)
plt.xlabel('Lifetime')
plt.ylabel('Frequency')
plt.title('Customer Consumption Lifetime Distribution')
Out[26]:
Text(0.5, 1.0, 'Customer Consumption Lifetime Distribution')

由图可知, 有大批客户仅消费一次, 部分用户集中在50-300天, 300天以后的, 属于高质量客户.
排除首次消费用户的影响, 平均生命周期是199.8天, 全部客户的平均生命周期是134.3天. 因此, 结合用户生命周期(4.4), 对首次消费的用户, 应当在30-40天内引导, 延长其消费生命周期.

5. Customer Segmentation

5.1 Bulid RFM model

Recency: days since last purchase Frequency: total purchased items / (or can be Visit Times) Monetary: total amount spent

In [27]:
rfm = df.pivot_table(values = ['InvoiceDate', 'Quantity', 'TotalPrice'], 
                    index='CustomerID',
                    aggfunc={'InvoiceDate': 'max',
                            'Quantity': 'sum',
                            'TotalPrice': 'sum'})
rfm['InvoiceDate'] = (rfm.InvoiceDate.max() - rfm.InvoiceDate)/np.timedelta64(1, 'D')
rfm.rename(columns={'InvoiceDate':'R', 'Quantity':'F', 'TotalPrice':'M'}, inplace=True)
rfm.describe()
Out[27]:
R F M
count 4312.000000 4312.000000 4312.000000
mean 90.454144 1284.592532 2048.185824
std 96.866467 6459.893518 8914.491881
min 0.000000 1.000000 1.550000
25% 17.374653 158.000000 307.950000
50% 52.143403 382.000000 706.020000
75% 135.422222 996.250000 1723.142500
max 373.420833 220600.000000 349164.350000

So far we design a RFM to evaluate customer. But we found that

Regularize by Log to Reduce Error that Cased by Large Numbers

In [28]:
rfm = rfm[rfm != 0].dropna() # 去掉为0的数据, 
log_rfm = np.log(rfm)
log_rfm.describe()
Out[28]:
R F M
count 4311.000000 4311.000000 4311.000000
mean 3.728458 5.952569 6.611980
std 1.530519 1.466513 1.286441
min -3.905103 0.000000 0.438255
25% 2.855372 5.062595 5.729937
50% 3.954044 5.945421 6.559119
75% 4.908397 6.904249 7.451964
max 5.922706 12.304106 12.763298
In [29]:
import seaborn as sns
sns.set_style('ticks')
# pd.plotting.scatter_matrix(log_rfm, alpha = 0.3, figsize = (10,10), diagonal = 'kde')
sns.pairplot(log_rfm, height=5, diag_kind="kde")
Out[29]:
<seaborn.axisgrid.PairGrid at 0x12435c978>

5.2 k-Means Classification

我们使用KMeans聚类方法, 将客户分为8个类

In [30]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=8, random_state=0).fit(log_rfm)
label = kmeans.labels_
labeled_rfm = log_rfm.copy()
labeled_rfm['Label'] = label
labeled_rfm.head()
Out[30]:
R F M Label
CustomerID
12346.0 5.101423 3.178054 5.132027 4
12347.0 0.793495 6.719013 7.187899 7
12348.0 4.293328 5.921578 5.403398 1
12349.0 3.749145 6.900731 7.890261 2
12351.0 2.321707 5.564520 5.706878 3

6. Describe Each Class

Customers are classified to 8 classes. According to features of each label, we evaluate different group of customer.

6.1 Customer Scatter Matrix

In [31]:
labeled_rfm.describe()
Out[31]:
R F M Label
count 4311.000000 4311.000000 4311.000000 4311.000000
mean 3.728458 5.952569 6.611980 3.336349
std 1.530519 1.466513 1.286441 2.142612
min -3.905103 0.000000 0.438255 0.000000
25% 2.855372 5.062595 5.729937 1.000000
50% 3.954044 5.945421 6.559119 3.000000
75% 4.908397 6.904249 7.451964 5.000000
max 5.922706 12.304106 12.763298 7.000000
In [32]:
sns.pairplot(labeled_rfm, hue='Label', height=5)
Out[32]:
<seaborn.axisgrid.PairGrid at 0x12c501ef0>

通过Scatter Matrix, 我们发现,所有的用户已经成功的分为8个群体, 我们的目标是对这8个群体进行特征描述.

6.2 3D Scatter

In [33]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(20,15))
ax = fig.add_subplot(111, projection='3d')
color_list = ['red','cyan','green','blue','crimson','yellowgreen','purple','brown']
for i in range(8):
    g = labeled_rfm[labeled_rfm['Label']==(i)]
    ax.scatter(g['R'], g['F'], g['M'], c=color_list[i])

6.3 Regularization: Map RFM to Score

Scores range from 1-5. For F and M, higher values are better, which has a higher weight. Recency is opposite, smaller value has higher weight.

In [34]:
quantiles = labeled_rfm.quantile(q=[0.2, 0.4, 0.6, 0.8]).to_dict()
def r_class(x, category, df):
    """
    x:  data before descretization. 
    category: R, F, M
    df: dict of quantiles
    return: score of pressent R/F/M value
    """
    if x<=df[category][0.2]:
        return 5
    elif x<=df[category][0.4]:
        return 4
    elif x<=df[category][0.6]:
        return 3
    elif x<=df[category][0.8]:
        return 2
    else:
        return 1

def fm_class(x, category, df):
    if x<=df[category][0.2]:
        return 1
    elif x<=df[category][0.4]:
        return 2
    elif x<=df[category][0.6]:
        return 3
    elif x<=df[category][0.8]:
        return 4
    else:
        return 5

"""
RFM Map to Score
"""
rfm_Seg = labeled_rfm.copy()
rfm_Seg['R'] = labeled_rfm['R'].apply(r_class, args=('R', quantiles))
rfm_Seg['F'] = labeled_rfm['F'].apply(fm_class, args=('F', quantiles))
rfm_Seg['M'] = labeled_rfm['M'].apply(fm_class, args=('M', quantiles))
rfm_Seg.head()
Out[34]:
R F M Label
CustomerID
12346.0 2 1 1 4
12347.0 5 4 4 7
12348.0 2 3 1 1
12349.0 3 4 5 2
12351.0 5 2 2 3

Combined with scatter, we can analyze the relation between Label and RFM value.

64 Scatter Matrix Labeled RFM

In [35]:
graph = sns.PairGrid(rfm_Seg, 
                 x_vars = ['R', 'F', 'M', 'Label'],
                 y_vars = ['R', 'F', 'M', 'Label'],
                 hue='Label', height=3, palette="husl", hue_kws={'marker':["o", "s", "D", "o", "s", "D", "o", "s"]})
graph = graph.map_diag(plt.hist)
graph = graph.map_offdiag(sns.scatterplot, edgecolor='w', s=300, alpha=0.3)
graph.add_legend() # add color description
Out[35]:
<seaborn.axisgrid.PairGrid at 0x12d97c8d0>

6.5 Description based on RFM feature

通过RFM值得特点, 结合Scatter Matrix, 对用户进行详细描述

In [36]:
label_avg_rfm = rfm_Seg.groupby('Label').agg({'R':'mean','F':'mean','M':'mean', 'Label':'count'})
label_avg_rfm.rename(columns={'Label':'Count'}, inplace=True)
label_avg_rfm = label_avg_rfm.round(2)
label_avg_rfm.describe()
Out[36]:
R F M Count
count 8.000000 8.000000 8.000000 8.000000
mean 3.221250 3.076250 3.115000 538.875000
std 1.384294 1.541918 1.507695 277.949861
min 1.370000 1.000000 1.230000 146.000000
25% 1.987500 1.647500 1.670000 267.750000
50% 3.285000 3.370000 3.355000 649.500000
75% 4.460000 4.295000 4.375000 702.750000
max 5.000000 5.000000 5.000000 912.000000
In [37]:
label_avg_rfm
Out[37]:
R F M Count
Label
0 4.46 5.00 5.00 276
1 2.02 2.98 2.90 912
2 2.77 4.55 4.51 633
3 3.80 1.74 1.77 666
4 1.89 1.00 1.23 146
5 4.46 3.76 3.81 688
6 1.37 1.37 1.37 747
7 5.00 4.21 4.33 243

0: Champion: Bought recently, buy often and spend the most! 7: Loyal Customers: Spend good money with us often. Responsive to promotions.

5: Potential Loyalist: Recent customers, but spent a good amount and bought more than once.

3: Promising: Recent shoppers, but haven’t spent much.

2: Can’t Lose Them: Spent big money and purchased often. But long time ago. Need to bring them back!

1: Customers Needing Attention: Around average recency, frequency and monetary values. May not have bought very recently though.

4: Lost: Lowest recency, frequency and monetary scores. 6: Hibernating: Last purchase was long back, low spenders and bought seldomly.

In [38]:
label_names = {
    0: 'Bought recently',
    1: 'Customers Needing Attention',
    2: 'Can’t Lose Them',
    3: 'Promising',
    4: 'Lost',
    5: 'Potential Loyalist',
    6: 'Hibernating',
    7: 'Loyal Customers',
}
plt.figure(figsize=(5, 5))
plt.pie(label_avg_rfm['Count'].to_list(), labels=label_names.values(), autopct='%1.1f%%', radius=1, shadow=True, startangle=90)
plt.title('All Labels Numbers')
Out[38]:
Text(0.5, 1.0, 'All Labels Numbers')

According to this pie chart, we can clearly know the percentage of each class.

7. Repurchase Rate

7.1 Customer Status for Each Month

For each month, if customer purchased, then the value is 1, else 0.

In [39]:
ind_monthly_visit = df.pivot_table(index='CustomerID', columns='Month', values='Invoice', aggfunc='nunique').fillna(0)
boolean_monthly_visit = ind_monthly_visit.applymap(lambda x: 1 if x>0 else 0)
boolean_monthly_visit.head()
Out[39]:
Month 2009-12 2010-01 2010-02 2010-03 2010-04 2010-05 2010-06 2010-07 2010-08 2010-09 2010-10 2010-11 2010-12
CustomerID
12346.0 0 0 0 1 0 0 1 0 0 0 0 0 0
12347.0 0 0 0 0 0 0 0 0 0 0 1 0 1
12348.0 0 0 0 0 0 0 0 0 0 1 0 0 0
12349.0 0 0 0 0 1 1 0 0 0 0 1 0 0
12351.0 0 0 0 0 0 0 0 0 0 0 0 1 0
In [40]:
from IPython.display import Image
Image('statusDecisionTree.jpg')
Out[40]:
New:      no purchase history before, and spend money this month
Inactive: purchased before, and didn't buy this month.
Return:   purchased before but did't consume last month, this month buy again.
Unknown:  no purchase history, and did't buy this month.

According to this decision tree, we can give a customer status for every month.

In [41]:
def check_status(record):
    status = []
    for i in range(13):
        if record[i] == 1:
            if len(status)==0:
                s='New'
            else:
                if status[i-1] == 'Unknown':
                    s = 'New'
                elif status[i-1] == 'Inactive':
                    s='Return'
                else:
                    s = 'Active'
        elif record[i] == 0:
            if len(status)==0:
                s='Unknown'
            else:
                if status[i-1] == 'Unknown':
                    s='Unknown'
                else:
                    s='Inactive'
        status.append(s)
    return pd.Series(status, index=record.index.astype('str').to_list())
ind_status_df = boolean_monthly_visit.apply(check_status, axis=1)
ind_status_df.head()
Out[41]:
2009-12 2010-01 2010-02 2010-03 2010-04 2010-05 2010-06 2010-07 2010-08 2010-09 2010-10 2010-11 2010-12
CustomerID
12346.0 Unknown Unknown Unknown New Inactive Inactive Return Inactive Inactive Inactive Inactive Inactive Inactive
12347.0 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown New Inactive Return
12348.0 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown New Inactive Inactive Inactive
12349.0 Unknown Unknown Unknown Unknown New Active Inactive Inactive Inactive Inactive Return Inactive Inactive
12351.0 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown New Inactive

7.2 Stackplot of Each Status

In [42]:
status_df = ind_status_df.apply(lambda x: x.value_counts()).fillna(0)
status_df.drop(['Unknown'], inplace=True)
status_df
Out[42]:
2009-12 2010-01 2010-02 2010-03 2010-04 2010-05 2010-06 2010-07 2010-08 2010-09 2010-10 2010-11 2010-12
Active 0.0 336.0 262 314 378 345 368 392 351 365 463 657 346.0
Inactive 0.0 618.0 939 1098 1507 1737 1932 2231 2410 2419 2444 2659 3739.0
New 954.0 383.0 374 444 294 254 270 186 162 243 377 325 46.0
Return 0.0 0.0 136 299 270 367 403 350 398 537 657 625 181.0
In [43]:
labels = ['Active', 'new', 'Return', 'Inactive']
x = status_df.columns.tolist()
plt.figure(figsize=(15, 5))
plt.stackplot(x, status_df.loc['Active'], status_df.loc['New'], status_df.loc['Return'], status_df.loc['Inactive'], labels=labels)
plt.legend(loc = 'upper left')
plt.xlabel('Month')
plt.ylabel('Count')
plt.title('Customer Status Stockplot')
Out[43]:
Text(0.5, 1.0, 'Customer Status Stockplot')

New: Lacking of previous data, so in early time, there was alot new customers. After, the number is stable.
Active: From January, the number is stable, reaching the peak on November.
Return: The number rises slowly, also reach the peak on November.
Inactive: Rapid increase, which means a lot customers didn't repurchase often.

7.3 Multiple Purchase and Repurchase Rate

7.3.1 Multiple Purchase Rate

Multiple Purchase: Customer spent more than once in one month.

In [44]:
multi_df = ind_monthly_visit.applymap(lambda x: 1 if x>1 else np.NaN if x==0 else 0)
rates = multi_df.sum()/multi_df.count()
rates
Out[44]:
Month
2009-12    0.298742
2010-01    0.204451
2010-02    0.226684
2010-03    0.224219
2010-04    0.213376
2010-05    0.236025
2010-06    0.246878
2010-07    0.274784
2010-08    0.240395
2010-09    0.235808
2010-10    0.237809
2010-11    0.327940
2010-12    0.188482
dtype: float64
In [45]:
plt.figure(figsize=(10, 5))
plt.plot(rates.index, rates)
plt.xlabel('Month')
plt.ylabel('Rate')
plt.title('Buyer Multiple Purchase Rate')
Out[45]:
Text(0.5, 1.0, 'Buyer Multiple Purchase Rate')

Multiple Purchase Rate was rising in this year. The lowest is 20%, maximun is 32%.

7.3.2 Repurchase Rate

Repurchase: For customer who have spent month, they came again this month.

In [46]:
Image('repurchaseDecisionTree.jpeg')
Out[46]:
In [47]:
def check_repurchase(record):
    status = [np.NaN]
    for i in range(1, 13):
        if record[i] == 1:
            if record[i-1]==1:
                s = 1
            else:
                s = 0
        else:
            s = np.NaN
        status.append(s)
    return pd.Series(status, index=record.index.astype('str').tolist())
repurchase_df = boolean_monthly_visit.apply(check_repurchase, axis=1)
repurchase_df.head()
Out[47]:
2009-12 2010-01 2010-02 2010-03 2010-04 2010-05 2010-06 2010-07 2010-08 2010-09 2010-10 2010-11 2010-12
CustomerID
12346.0 NaN NaN NaN 0.0 NaN NaN 0.0 NaN NaN NaN NaN NaN NaN
12347.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN 0.0
12348.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN
12349.0 NaN NaN NaN NaN 0.0 1.0 NaN NaN NaN NaN 0.0 NaN NaN
12351.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN

def purchase_back(record): status = [] for i in range(12): if record[i] == 1: if record[i+1] == 1: status.append(1)
if record[i+1] == 0: status.append(0) else: status.append(np.NaN)
status.append(np.NaN) return pd.Series(status, index=record.index.astype('str').to_list()) boolean_monthly_visit.apply(purchase_back, axis=1).sum()

In [48]:
repurchase_rate = repurchase_df.sum()/repurchase_df.count()
plt.figure(figsize=(10, 5))
plt.plot(repurchase_rate.index, repurchase_rate)
plt.xlabel('Month')
plt.ylabel('Rate')
plt.title('Buyer Repurchase Rate')
Out[48]:
Text(0.5, 1.0, 'Buyer Repurchase Rate')

The annual repurchase rate is approximately 30% to 50%, and this repurchase rate is also high.

8. Customer Classification Summary

Use three attribute to describe customer: Toal quantity/visit/amount.

In [156]:
# For individual
ind_summary_df = df.pivot_table(index='CustomerID', values=['TotalPrice'], aggfunc={'TotalPrice':'sum'})
ind_summary_df['TotalQuantity'] = df.groupby(['CustomerID'])['Quantity'].sum()
ind_summary_df['TotalVisit'] = df.groupby(['CustomerID'])['Invoice'].nunique()
ind_summary_df['Label'] = labeled_rfm['Label'].astype('str')

# Group by Label
label_summary_df = ind_summary_df.groupby('Label').agg({'TotalPrice':'sum', 'TotalQuantity':'sum','TotalVisit':'sum', 'Label':'count'})
label_summary_df.rename(columns={'Label':'Count'}, inplace=True)
In [175]:
percentage_df = label_summary_df/label_summary_df.sum()
percentage_df
Out[175]:
TotalPrice TotalQuantity TotalVisit Count
Label
0 0.48 0.53 0.28 0.06
1 0.07 0.07 0.11 0.21
2 0.19 0.18 0.18 0.15
3 0.02 0.02 0.06 0.15
4 0.01 0.00 0.01 0.03
5 0.11 0.10 0.18 0.16
6 0.02 0.02 0.05 0.17
7 0.09 0.08 0.13 0.06
In [181]:
label_avg_df = label_summary_df[['Count']]
label_avg_df['Avg_Price'] = label_summary_df['TotalPrice']/label_summary_df['Count']
label_avg_df['Avg_Quantity'] = label_summary_df['TotalQuantity']/label_summary_df['Count']
label_avg_df['Avg_Visit'] = label_summary_df['TotalVisit']/label_summary_df['Count']
label_avg_df['Avg_Purchase_Value'] = label_avg_df['Avg_Price']/label_summary_df['TotalVisit']
label_avg_df.style.bar()
/Users/zehua/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
/Users/zehua/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
Out[181]:
Count Avg_Price Avg_Quantity Avg_Visit Avg_Purchase_Value
Label
0 276 15469.5 10724.5 19.3514 2.89637
1 912 725.461 421.988 2.40351 0.330959
2 633 2623.4 1588.62 5.60664 0.739193
3 666 330.152 167.054 1.66667 0.297434
4 146 347.82 13.8014 1.21233 1.96508
5 688 1400.59 781.044 4.91279 0.414377
6 747 225.079 116.201 1.26774 0.237676
7 243 3443.65 1855.47 10.3045 1.37526
In [174]:
label_avg_rfm.style.bar(color='#33caff')
Out[174]:
R F M Count
Label
0 4.46 5 5 276
1 2.02 2.98 2.9 912
2 2.77 4.55 4.51 633
3 3.8 1.74 1.77 666
4 1.89 1 1.23 146
5 4.46 3.76 3.81 688
6 1.37 1.37 1.37 747
7 5 4.21 4.33 243

8.1 Champion (Label 0)

Champion: Bought recently, buy often and spend the most!
The number of such customers accounts for 0.06%, and the average amount of each purchase is high, they should be rewarded. They are willing to try to accept new products. Combined with the previous scatter plot, we can judge that it contains a part of wholesalers.
这类用户的数量占0.06%, 而且平均每次消费额都很高, 应当奖励他们, 也可以让他们尝试接受新的产品. 结合之前的散点图, 我们可以判断其中包含了一部分批发商.

8.2 Customer Needing Attention (Label 1)

Customers Needing Attention: Around average recency, frequency and monetary values. May not have bought very recently though.
Their proportion is 21.2%, which is the largest proportion of customers, but the amount of purchases is relatively low, indicating that the store is not attractive enough for them, and can provide a Limited Offer to re-engage them.
人数占比21.2%, 是占比最多的客户, 但是消费次数金额却比较低, 说明store对他们的吸引不够强大, 可以提供Limited Offer, 重新使他们产生兴趣

8.2 Can’t Lose Them (Label 2)

Can’t Lose Them: Spent big money and purchased often. But long time ago. Need to bring them back!
The proportion of this part of customers is 14.7%, and the total number of visits and total consumption is as high as 18%. They are very important customers. Their consumption amount and frequency are very high, but they have not appeared for a long time recently. Possible reasons: our products are not competitive, other stores attract this part of users. Provide newer and better products, research competitors' policies, and win back these customers.
这占比14.7, 而总访问次数和消费总额高达18%以上, 以前消费金额和频率都非常高, 但是最近很久没有出现过, 可能出现的原因: 厂家自身产品得不到认可, 其他商店将这部分用户吸引了. 可以提供更新更优质的产品, 研究竞争对手的操作, 将这部分客户赢回来.

8.4 Promising (Label 3)

Promising: Recent shoppers, but haven’t spent much.
The proportion of this part of customers is 15.4%. They have been here recently, but have n’t consumed a lot. We should build brand awareness and provide free trials.
是用户占比15.4%, 最近来过, 目前还没有消费很多, 应当建立品牌知名度, 提供免费试用等

8.5 Lost (Label 4)

Lost: Lowest recency, frequency and monetary scores.
The proportion of this part of users is very small and only 0.03%. They have been here long time ago. The RFM index is very low. They are low-value users. We can choose to attract them through promotion activities, otherwise give up.
这部分用户比重很小仅仅0.03%, 很久以前来过, RFM的指数都非常低, 是低价值的用户, 可以选择通过推广活动进行召回, 否则放弃

8.6 Potential Loyalist (Label 5)

Potential Loyalist: Recent customers, but spent a good amount and bought more than once.
There are 16% of these customers. The average consumption amount and frequency are noticeable, but the average amount is lower than the overall average (1700 pounds). We should provide membership for these customers or the loyalty reward program, to improve their loyalty
这一部分客户有16%, 人均消费金额和次数都很多, 但是人均消费均额低于总体平均(1700磅)是为这一部分客户提供会员, 成立loyalty奖励项目

8.7 Hibernating (Label 6)

Hibernating: Last purchase was long back, low spenders and bought seldomly. This part of customers account for 17%, which is alot. The RFM index is very low. We can provide special discounts to attract them, and recreate brand value with pressenting relevant products.

8.8 Loyal Customers (Label 7)

Loyal Customers: Spend good money with us often. Responsive to promotions.
The proportion of customers is small (0.09%), but the average number of visit and consumption are much higher than ordinary users. They are customers that need attention. 这部分的客户比例比较少(0.09%), 但是购买次数和人均消费远高于普通用户, 是需要重视的客户, 可以适当提高

In [189]:
ind_status_df['Label'] = labeled_rfm['Label'].astype('str')
ind_status_df.head()
Out[189]:
2009-12 2010-01 2010-02 2010-03 2010-04 2010-05 2010-06 2010-07 2010-08 2010-09 2010-10 2010-11 2010-12 Label
CustomerID
12346.00 Unknown Unknown Unknown New Inactive Inactive Return Inactive Inactive Inactive Inactive Inactive Inactive 4
12347.00 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown New Inactive Return 7
12348.00 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown New Inactive Inactive Inactive 1
12349.00 Unknown Unknown Unknown Unknown New Active Inactive Inactive Inactive Inactive Return Inactive Inactive 2
12351.00 Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown New Inactive 3
In [280]:
def get_status_rate(status):
    """
    根据指定Status, 获取每个类在各个月份的占比
    """
    status_label_count = pd.DataFrame()
    for month in ind_status_df.columns[1:-1]:
        result = ind_status_df.groupby('Label')[month].value_counts()
        status_label_count[month] = result
    for i in range(8):
        status_label_count.T.loc['2010-01'][(str(i), status)]
    label_specificStatus_df = pd.DataFrame()

    for month in status_label_count.T.index:
        # 该月, 8个类, 分别有多少是Inactive   [171.0, 912.0, 633.0, 604.0, 143.0, 529.0, 747.0, nan]
        label_inactive = pd.Series([status_label_count.T.loc[month][(str(i), status)] for i in range(8) ])
        label_specificStatus_df[month] = label_inactive/label_inactive.sum()
    return label_specificStatus_df
label_specificStatus_df = get_status_rate('Inactive')
label_specificStatus_df
Out[280]:
2010-01 2010-02 2010-03 2010-04 2010-05 2010-06 2010-07 2010-08 2010-09 2010-10 2010-11 2010-12
0 0.09 0.08 0.05 0.05 0.05 0.04 0.04 0.03 0.04 0.03 0.01 0.05
1 0.19 0.22 0.23 0.23 0.25 0.25 0.26 0.25 0.25 0.27 0.34 0.24
2 0.17 0.19 0.19 0.18 0.17 0.17 0.16 0.16 0.15 0.13 0.18 0.17
3 0.06 0.06 0.06 0.06 0.06 0.06 0.07 0.07 0.07 0.06 0.08 0.16
4 0.02 0.03 0.03 0.03 0.03 0.03 0.03 0.04 0.04 0.05 0.05 0.04
5 0.20 0.17 0.16 0.16 0.15 0.14 0.14 0.16 0.14 0.14 0.03 0.14
6 0.19 0.18 0.22 0.23 0.24 0.25 0.25 0.25 0.27 0.30 0.28 0.20
7 0.08 0.08 0.06 0.06 0.05 0.05 0.04 0.04 0.04 0.03 0.03 nan
In [281]:
labels = range(8)
x = label_inactive_df.columns.tolist()
plt.figure(figsize=(15, 5))
plt.stackplot(x, label_inactive_df.loc[0], label_inactive_df.loc[1], label_inactive_df.loc[2], label_inactive_df.loc[3], label_inactive_df.loc[4], label_inactive_df.loc[5], label_inactive_df.loc[6], label_inactive_df.loc[7], labels=labels)
plt.legend(loc = 'upper left')
plt.xlabel('Month')
plt.ylabel('Count')
plt.title('Customer Status Stockplot')
Out[281]:
Text(0.5, 1.0, 'Customer Status Stockplot')

According to this stackplot, there are 3 group of customers have a high perccentage which are 1, 2, 6. e should pay attention on them. 根据这个图, 我们可以看出, 在inactive客户里, 1,2和6的比重非常高, 我们应当从着重提高这三部分的业绩.

9 Summary

  1. Average consumption amount is 2014.3, 73% of customer spent less than £1700, and there are a group of wholesales.
  2. Average period between two purchase is 45 days. So for new customer, we should do "wake up" them in 30-50 days.
  3. Excluding users who only consumed once, the average life cycle of all customers is 199 days.If new users can be guided to consume multiple times, there will be a 0.5-fold increase.
  4. In stackplot, the percentage of return/active/new are stable, but inactive rate is increasing repidlu. It shows that we are losing old customers, and new customers didn't come back neither. And a large number of users in the middle and late periods have not made consumption again, and the R value of this part of customers may be between 2 to 4.
  5. According to monthly Inactive customers of each label, we noticed label 2,3, they accounted for a large proportion up to 35.9% of the total number of consumers. Label 2 has a high consumption frequency and amount, but didn't appear for long time, we can't lose it!! Label 3 is recent visits, and have not yet spent a lot of time, they are valuable potential customers.
  1. 用户平均消费2014.3镑, 大多数低于75%的客户低于1700, 有小部分的高价值客户.
  2. 平均购买间隔是45天, 对于新的用户, 应当在30-50天内进行引导, 延长消费生命周期
  3. 除去仅消费一次的用户, 全部客户的平均生命周期是199天, 如果能够引导新用户进行多次消费, 那么将会有0.5倍的增量. 这部分的客户主要是label3, Promising
  4. 在stackplot中, return/active/new的比例都较稳定, 但是inactive却大幅上升, 说明老客户流失或新客户没有后续消费, 而且中后期有很大一部分用户没有再进行过消费, 这部分客户的R值(2-4)区间内, 价值较高的有1和2, 这两部分不但人数多(占比35.9%), 而且F(2.98/4.55)和M值(2.90/4.51)都相对较高, 是我们需要分析的对象.
  5. 根据我们对每月Inactive的客户组成结构, 我们发下2和3两个部分占很大比例, 这两个部分占总消费人数的35.9%, 其中2消费频率和次数都很高, 但有段时间没有购买, 是我们不能丢失的, 3近期到访, 但是暂时还没有消费很多, 是有价值的潜在客户.
In [ ]: